Let“s have a look at the number of variables of our data and its distribution. We can see below the structure and summary of variables in our dataset.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
We have 4898 observations with 12 variables, 11 measure characteristics, 1 score based on sensory data. And the index (X).
How clean is our data? Are there any duplicates?
##
## FALSE TRUE
## 3961 937
Yes, we have 937 duplicates, are those entries real data? two wines could have exactly the same values in all the variables, including the quality, which would confirm the quality score given by the experts (this quality score is the median of at least 3 evaluations made by wine experts).
Are there wines with the same variables and different quality? How consistent is the sensorial variable?
##
## FALSE TRUE
## 3961 937
Well, the quality value seems consistent. Without more information about the data set, we cannot get rid of the duplicates. But we will have a look at their values later.
The first variable we are curious about is the quality, we take a look at it in a histogram.
As described in the dataset documentation, it is quite unbalanced, with most of the wines scoring at 6, with just a few poor and really high quality wines and no wines with a quality lower than 3 or higher than 9.
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
We also want to know how many of the wines are sweet, so we analyze the residual sugar variable.
Most of the wines have a residual sugar less than 20 gr/l, which means they are mostly dry wines. But there are some sweet wines too.
We have 5 high sugar values, but only one seems to be considered āsweetā (over 45 grams/liter of residual sugar). We check that programatically: how many of the wines have a residual.sugar higher than 45 grams/liter?
##
## FALSE TRUE
## 4897 1
We do have only one sweet wine in the data, should we get rid of it as an outlier for the shake of this study? Comparing dry and sweet wines could be like mixing apples and oranges for some experts, but since we are talking about white wines in general, we will use this value. We get a closer look to the residual.sugar variable:
The distribution is right skewed with a peak between 0 and 2.5 gr/l of residual sugar.
With a log transformation in the previous plot, we can see that the distribution of residual sugar is kind of multimodal with peaks around 1.5, 8 and 13, and with minimums around 3.5 and 9.5 gr/l.
In order to compare the sweetness of the wine with other variables later in this study we want to have both a qualitative and a quantitative variable, so we create a new column using the sweetness wine classification that can be found in winefolly.com (see references)
Here we have the distribution of this new variable.
##
## dry off-dry semi-sweet medium sweet very sweet
## 3531 1284 82 1 0
Calculating the percentage of dry and off-dry wines of the whole dataset.
## [1] 0.9830543
White wines pH is normally between 3 and 4, in this case we can see that the pH of our sample is between 3 and 3.3. āVinho verdeā is a quite young wine, which means itĀ“s quite acid, we can agree on that based on the data.
Tartaric, acetic and citric acids seem to have several outliers. So let“s zoom in the most common values to see its distribution better:
The distribution of the most common values for the acids and pH is quite normal. Having the citric acid, two unexpected peaks:
One is at 0.49, and the other one much smaller, at 0.74.
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 19 7 6 2 12 5 6 12 4 12 14 1 19 17 27
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 23 33 27 49 48 70 66 104 83 181 136 219 216 282 223
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 307 200 257 183 225 137 177 134 122 101 117 82 95 37 63
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 46 51 38 39 215 35 25 23 16 19 11 22 13 21 6
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 6 9 14 4 6 8 7 7 7 5 3 9 5 5 41
## 0.78 0.79 0.8 0.81 0.82 0.86 0.88 0.91 0.99 1 1.23 1.66
## 2 2 2 2 2 1 1 2 1 5 1 1
There are 215 wines with 0.49 gr/l of citric acid, and 41 with 0.74 gr/l. Could these values be related with the duplicates we“ve seen before?
## dup_citric
## 0 0.01 0.04 0.06 0.07 0.09 0.1 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19
## 1 1 2 1 1 1 1 4 3 6 5 6 8 10 8
## 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34
## 14 11 21 15 38 24 46 52 62 44 68 38 43 28 44
## 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49
## 23 41 23 22 21 16 14 21 5 12 7 3 3 5 42
## 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.61 0.62 0.63 0.64 0.65 0.66
## 5 6 5 2 5 3 4 3 5 4 4 1 1 3 3
## 0.67 0.68 0.69 0.71 0.73 0.74 0.8 0.91
## 3 2 2 2 1 7 1 1
We can see that 42 of the duplicates have 0.49 gr/l and 7 of them 0.74 gr/l, so, even if we get rid of the duplicates we would still have the peaks in those two values. We cannot say they are wrong values due to the duplicates.
We also create a qualitative column out of the ph values, considering more acidic the wines in the lower quartile, and more basic the wines in the upper quartile. The wines in the Interquartile Range will be considered medium.
##
## more acidic medium more basic
## 1314 2402 1182
Sulphates are added to the wine and can contribute to the SO2, and the total SO2 includes the free SO2, so, these three variables should be somehow correlated (we“ll check that out later). Sulphates distribution seems a little bit right skewed, while free and total SO2 distributions seem normal with some outliers.
We create a new column, from the sulphate column which is quantitative, to get a qualitative variable called sulf_level that we will use in the multivariate study. The ranges are selected based on its quartiles.
##
## lower low high higher
## 1321 1133 1294 1150
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
We can see that most of the values are either integers or have one decimal digit. But the cases with more than one decimal digits are just a few.
Let“s create a new column for the alcohol level to study deeply this variable (classification based in winefolly.com).
Chloride distribution is positive skewed with most of the values between 0 and 0.1, let“s zoom in:
##
## FALSE TRUE
## 110 4788
## [1] 97.75419
97.7% of the wines have a value of chloride lower than 0.1 grams per liter (4788 entries out of 4898).
##
## FALSE TRUE
## 4895 3
Only 3 wines have a density over 1.005 g/cm3
The distribution without the three outliers is quite normal, could be those outliers measure errors? Given that variations in density is of hundredths of gram per cm3, those could be real values.
There are 4898 observations of white wines and 12 features: - tartaric acid (fixed.acidity) - acetic acid (volatile.acidity) - citric acid (citric.acid)
- residual sugar (residual.sugar)
- salt (chlorides)
- free SO2 (free.sulfur.dioxide) - total SO2 (total.sulfur.dioxide)
- density
- pH
- sulphates
- alcohol
- quality
All of them, but quality, are measured from the 4898 samples of wine, quality is based on sensory data (perception, median of at least 3 evaluations made by wine experts).
The dataset is quite unbalanced, most of the whines score medium values for quality, with just a few of them being considered of poor or high quality, comparing variables in this case and finding relations between them can be a little harder since we are not covering properly all the spectrum of quality scores (0, 10).
98.3% of the samples are either dry or off-dry wines, comparing to sweet wines.
As expected due to the type of wine, the samples show quite acid pH levels.
Free and total SO2 are related since one contains the other.
Most of the wines show low or medium-low levels of alcohol (less than 11.5%).
Most of the wines are low in salt and have a density less than 1 gr per cm3.
According to my research, a perfect wine has a great balance between acidity, tannin, alcohol and sweetness. So I guess that pH, the three acids, alcohol and residual sugar will be, together with quality, the most interesting ones. I am trying to guess which of the variables are more meaningful to the quality of the wine⦠it would probably be a combination of them.
I will take a look at the sulfur dioxides (free and total) as well as the sulphates to see how they affect quality.
I created a column for wine types according to sugar, alcohol and ph, so I could compare the prices of the different types. I used the classifications found in http://winefolly.com/
The distribution of residual sugar appears to be multimodal once log-transformed, with peaks around 1.5, 8 and 13, and with minimums around 3.5 and 9.5 gr/l. Regarding alcohol, most of the values are either intengers or have one decimal digit. But the cases with more than one decimal digits are just a few. The citric acid distribution shows an unexpected peak at 0.49 gr/l and a second grade peak at 0.74 gr/l.
First, let“s see the correlation matrix. We need to get rid of the index (without any information) and all the qualitative variables.
## tartaric acetic citric sugar salt f_SO2 t_SO2 dens pH sulph
## tartaric 1.00 -0.02 0.29 0.09 0.02 -0.05 0.09 0.27 -0.43 -0.02
## acetic -0.02 1.00 -0.15 0.06 0.07 -0.10 0.09 0.03 -0.03 -0.04
## citric 0.29 -0.15 1.00 0.09 0.11 0.09 0.12 0.15 -0.16 0.06
## sugar 0.09 0.06 0.09 1.00 0.09 0.30 0.40 0.84 -0.19 -0.03
## salt 0.02 0.07 0.11 0.09 1.00 0.10 0.20 0.26 -0.09 0.02
## f_SO2 -0.05 -0.10 0.09 0.30 0.10 1.00 0.62 0.29 0.00 0.06
## t_SO2 0.09 0.09 0.12 0.40 0.20 0.62 1.00 0.53 0.00 0.13
## dens 0.27 0.03 0.15 0.84 0.26 0.29 0.53 1.00 -0.09 0.07
## pH -0.43 -0.03 -0.16 -0.19 -0.09 0.00 0.00 -0.09 1.00 0.16
## sulph -0.02 -0.04 0.06 -0.03 0.02 0.06 0.13 0.07 0.16 1.00
## alc -0.12 0.07 -0.08 -0.45 -0.36 -0.25 -0.45 -0.78 0.12 -0.02
## qual -0.11 -0.19 -0.01 -0.10 -0.21 0.01 -0.17 -0.31 0.10 0.05
## alc qual
## tartaric -0.12 -0.11
## acetic 0.07 -0.19
## citric -0.08 -0.01
## sugar -0.45 -0.10
## salt -0.36 -0.21
## f_SO2 -0.25 0.01
## t_SO2 -0.45 -0.17
## dens -0.78 -0.31
## pH 0.12 0.10
## sulph -0.02 0.05
## alc 1.00 0.44
## qual 0.44 1.00
And what about the correlation matrix if we get rid of the duplicates?:
## tartaric acetic citric sugar salt f_SO2 t_SO2 dens pH sulph
## tartaric 1.00 -0.02 0.30 0.08 0.02 -0.06 0.08 0.27 -0.43 -0.02
## acetic -0.02 1.00 -0.16 0.10 0.09 -0.10 0.10 0.06 -0.05 -0.02
## citric 0.30 -0.16 1.00 0.11 0.13 0.09 0.12 0.16 -0.18 0.05
## sugar 0.08 0.10 0.11 1.00 0.08 0.31 0.41 0.82 -0.17 -0.02
## salt 0.02 0.09 0.13 0.08 1.00 0.10 0.19 0.25 -0.09 0.02
## f_SO2 -0.06 -0.10 0.09 0.31 0.10 1.00 0.62 0.29 -0.01 0.04
## t_SO2 0.08 0.10 0.12 0.41 0.19 0.62 1.00 0.54 0.01 0.14
## dens 0.27 0.06 0.16 0.82 0.25 0.29 0.54 1.00 -0.06 0.08
## pH -0.43 -0.05 -0.18 -0.17 -0.09 -0.01 0.01 -0.06 1.00 0.14
## sulph -0.02 -0.02 0.05 -0.02 0.02 0.04 0.14 0.08 0.14 1.00
## alc -0.11 0.05 -0.08 -0.40 -0.36 -0.25 -0.45 -0.76 0.09 -0.02
## qual -0.12 -0.19 0.01 -0.12 -0.22 0.01 -0.18 -0.34 0.12 0.05
## alc qual
## tartaric -0.11 -0.12
## acetic 0.05 -0.19
## citric -0.08 0.01
## sugar -0.40 -0.12
## salt -0.36 -0.22
## f_SO2 -0.25 0.01
## t_SO2 -0.45 -0.18
## dens -0.76 -0.34
## pH 0.09 0.12
## sulph -0.02 0.05
## alc 1.00 0.46
## qual 0.46 1.00
We do not appreciate a huge change in the correlation results, it doesnāt make any sense to get rid of the duplicates, knowing that they could be real values.
Our first goal is undertanding how quality relates to other variable, but there are not strongly correlated variables to quality, just alcohol is moderately positive correlated to the quality of the wine (0.44). The highest correlation are:
As expected, we cannot see a strong correlation here, but we can see the weak positive correlation. Let“s use the alc_level column to further study this relationship.
We can see that the median quality of the wine is higher with a higher level of alcohol which agrees with the moderate positive correlation between the two variables.
##
## Call:
## lm(formula = quality ~ alcohol, data = wdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5317 -0.5286 0.0012 0.4996 3.1579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.582009 0.098008 26.34 <2e-16 ***
## alcohol 0.313469 0.009258 33.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared: 0.1897, Adjusted R-squared: 0.1896
## F-statistic: 1146 on 1 and 4896 DF, p-value: < 2.2e-16
Only the 19 percent of the variance in quality is due to changes in the alcohol quantity.
##
## dry off-dry semi-sweet medium sweet very sweet
## 3531 1284 82 1 0
## wdata$sweet_level: dry
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.927 6.000 9.000
## --------------------------------------------------------
## wdata$sweet_level: off-dry
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 5.00 6.00 5.77 6.00 9.00
## --------------------------------------------------------
## wdata$sweet_level: semi-sweet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.000 5.000 5.000 5.439 6.000 7.000
## --------------------------------------------------------
## wdata$sweet_level: medium sweet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6 6 6 6 6 6
## --------------------------------------------------------
## wdata$sweet_level: very sweet
## NULL
We cannot really say much about the quality based on the sweetness of the wine.
We can see the lineal correlation between these two variables, that is also true for the ouliers. Having those samples with high residual sugar and also high density makes us think that they are not outliers but real values of sweeter wines.
Let“s zoom in a little bit so we can see the line better.
##
## Call:
## lm(formula = density ~ residual.sugar, data = wdata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0056862 -0.0011059 0.0001726 0.0011523 0.0155617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.909e-01 3.742e-05 26480.7 <2e-16 ***
## residual.sugar 4.947e-04 4.586e-06 107.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001628 on 4896 degrees of freedom
## Multiple R-squared: 0.7039, Adjusted R-squared: 0.7038
## F-statistic: 1.164e+04 on 1 and 4896 DF, p-value: < 2.2e-16
As per R^2, 70% of the variance in density can be explained by the values of residual.sugar.
In this case we can see the negative lineal correlation, but the density outliers (in red in the first graph) don“t seem to keep the same relationship. They have high density and not so low alcohol.
These are the same outliers we“ve seen in the residual sugar - density plot. Let“s have a look at them:
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1654 1654 7.9 0.330 0.28 31.6
## 1664 1664 7.9 0.330 0.28 31.6
## 2782 2782 7.8 0.965 0.60 65.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1654 0.053 35 176 1.01030 3.15
## 1664 0.053 35 176 1.01030 3.15
## 2782 0.074 8 160 1.03898 3.39
## sulphates alcohol quality sweet_level ph_level sulf_level
## 1654 0.38 8.8 6 semi-sweet medium lower
## 1664 0.38 8.8 6 semi-sweet medium lower
## 2782 0.69 11.7 6 medium sweet more basic higher
## alc_level
## 1654 low
## 1664 low
## 2782 medium
There are three outliers (painting in red in the first scatter plot, two are duplicates). Checking the values of these outliers regarding other variables, we can see that the highest density wine has also an outstanding high value in volatile acidity (acetic acid).
There is a not-very-strong negative correlation between sugar and alcohol (-0.45), the plots donāt help much in studying that correlation.
Let“s confirm the residual sugar outliers are the same than the density ones.
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1654 1654 7.9 0.330 0.28 31.6
## 1664 1664 7.9 0.330 0.28 31.6
## 2782 2782 7.8 0.965 0.60 65.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1654 0.053 35 176 1.01030 3.15
## 1664 0.053 35 176 1.01030 3.15
## 2782 0.074 8 160 1.03898 3.39
## sulphates alcohol quality sweet_level ph_level sulf_level
## 1654 0.38 8.8 6 semi-sweet medium lower
## 1664 0.38 8.8 6 semi-sweet medium lower
## 2782 0.69 11.7 6 medium sweet more basic higher
## alc_level
## 1654 low
## 1664 low
## 2782 medium
Yes, as expected, because sugar and density are strongly correlated.
How are the different acids related and how related they are to quality?
As expected, there is not correlation at all between the fixed and volatile acidity.
But pH and tartaric acid (fixed.acidity) are negatively slightly related.
Let“s zoom in:
Let“s have a deeper look now at the peaks in citric acid, are there more wines with those special values of citric acid in search of a better quality?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 5.000 6.000 5.893 6.000 9.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 5.000 6.000 5.659 6.000 8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
We can see here the summary of the quality for the whole data set and for these special values. And we will not appreciate any special change in quality associated with these volumes of citric acid. No reason, apparently, for having those specific values of citric acid, without more information. We can add later more variables to the study.
As expected these two variables are positve correlated, even for the ouliers with high free and total SO2. Let“s zoom in.
Let“s see if the outliers of SO2 are related somehow to the outliers in residual sugar and density:
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 4746 4746 6.1 0.26 0.25 2.9
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 4746 0.047 289 440 0.99314 3.44
## sulphates alcohol quality sweet_level ph_level sulf_level
## 4746 0.64 10.5 3 dry more basic higher
## alc_level
## 4746 medium low
We can see that this value is an outlier for both forms of SO2 and it presents a high value for sulphates, its values regarding other variables are not outstanding, but we should point out that the perceived quality for this wine is among the lowest of the dataset (3). We know that high concentrations of free SO2 can be evident in nose and taste, in this case this flavour could have affect the quality score, but what about the rest of the dataset? Let“s see what happens to other wines with concentrations of free SO2 over 50ppm (50 mg/l).
##
## FALSE TRUE
## 4030 868
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.915 6.000 9.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.705 6.000 9.000
It“s hard to say that having over 50ppm of free SO2 in wine affects the quality although the average score in this case is slightly lower than the score for wines with lower concentrations of free SO2. But knowing the correlation between them (0.01) we cannot say they are related at all.
Having a correlation coefficient of 0.01, free SO2 and quality are not really correlated, although the previous plot show that for really high values of free SO2, the quality seems to be lower.
We can“t see any high correlated variables to quality, the strongest correlation of quality is with alcohol, but it is a weak correlation (0.44 - the highest the level of alcohol the slightly better quality score the wine has), but only 18% of the variance in quality is explained by alcohol.
The strongest correlation in the dataset is found between density and residual sugar (correlation coefficient of 0.84), 70% of the variance in density is explained by the residual sugar. Density is also correlated, but negatively, with alcohol (coefficient of -0.78), in this case we have some outliers that don“t follow this relationship, with high density, but normal values of alcohol.
Residual sugar is negatively correlated to alcohol (not strongly though: -0.45) and very weakly negatively correlated with quality (-0.10).
Regarding acids, pH and tartaric acid (fixed acidity) are negatively correlated (-0.43).
Despite the fact that high values of acetic acid (volatile acidity) can lead to a vinegar taste, this acid and quality are very weakly correlated (-0.19). Similarly despite that high concentrations of free SO2 can be evident in nose and taste, free SO2 is not correlated with the perceived quality (0.01).
Both forms of SO2 (free and total) are positively correlated (0.62) which was expected because total SO2 includes the free form of it. I did not consider density as a main feature of interest before, but seeing its relation with other features I do know. I have talked about those relations in the previous paragraph.
The strongest correlation in the dataset is found between density and residual sugar (correlation coefficient of 0.84), 70% of the variance in density is explained by the residual sugar.
We can see the stronger correlation between density and sweetness we“ve already talked about, but we can also see a weak negative correlation between quality and density in the plot.
We can see that there is a stronger concentration of dry wines points to the right of the plot, due to the negative correlation between alcohol and sugar, but quality doesn“t seem to be related to the sweetness of the wine. It is weakly correlated though, to the alcohol of the wine, for higher alcohol levels, quality seems to be a little bit better.
We cannot see the medium sweet wine due to overlapping, where is that point? Even increasing the alpha to add transparency and the possibility of appreciating that point it is hard to see it.
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 2782 2782 7.8 0.965 0.6 65.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 2782 0.074 8 160 1.03898 3.39
## sulphates alcohol quality sweet_level ph_level sulf_level
## 2782 0.69 11.7 6 medium sweet more basic higher
## alc_level
## 2782 medium
It is around the 6 in quality (due to the jitter), and alcohol 11.7. If we filter out the graph getting rid of the dry wines:
We can see that the cloud of points are slightly darker to the right of the graph and also at an area around 9% of alcohol and 15 g/dm3 of residual sugar. Therefore we can say that in our data, for drier wines, the better quality is found at higher levels of alcohol, but for sweeter wines, low levels of alcohol seem to be a better combination for perceived quality.
We can see quite clearly a relation between residual sugar and alcohol level (the less the sugar, the higher the alcohol level) these variables are not strongly correlated, but we can say that correlation is stronger for a given value of density. We can also see that the lighter dots (those with lower alcohol) are distributed throughout the whole distribution of density, but the darker ones are more comon at lower densities, which was expected due to the negative correlation between density and alcohoL.
LetĀ“s now take a look at the āsweet-dense outliersā:
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1654 1654 7.9 0.330 0.28 31.6
## 1664 1664 7.9 0.330 0.28 31.6
## 2782 2782 7.8 0.965 0.60 65.8
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1654 0.053 35 176 1.01030 3.15
## 1664 0.053 35 176 1.01030 3.15
## 2782 0.074 8 160 1.03898 3.39
## sulphates alcohol quality sweet_level ph_level sulf_level
## 1654 0.38 8.8 6 semi-sweet medium lower
## 1664 0.38 8.8 6 semi-sweet medium lower
## 2782 0.69 11.7 6 medium sweet more basic higher
## alc_level
## 1654 low
## 1664 low
## 2782 medium
We can see in the table, specially for the point with index 2782 (the highest value of residual sugar and density), that it has also high values in many variables: volatile acidity, citric acid, pH and sulfates, but those values doesnāt seem to affect the quality. WeĀ“ll have a look to those outliers with plots:
Quality is six for the sweet outliers, as we can see. But we can also see, if we zoom in, we can see there is a weak negative relation between density and quality (as we have already mentioned), the lower the density the bigger concentration of high quality points wecan see, also, we can appreciate that for residual sugar around 15, high quality wines allow a higher density.
The sweetest has a higher values of pH (more basic), and quite high values of citric acid.
The sweetest outlier have higher values of sulphates and volatile acidity (acetic acid).
Let“s see now the relation between acidities and pH together.
In this graph we can clearly see the positive relation between fixed acidity and pH level but not between the volatile acitidy and the pH level.
Again, if we add the citric acid to the graph we can see a stronger relation between the pH level and fixed acidity, whereas the relation between the pH level and the citric acid is not clear(their correlation coefficient was -0.16).
And now, let“s have a look to the sulphates and sulfurs:
We cannot see relation between the sulphates and the SO2 values (sulfur dioxide), apart from the positive correlation between the two sulfur dioxides we have alredy talked about.
If we wanted to create a linear model, the variable that shows stronger relations with other variables is density, zooming in that relations we have the next plots:
##
## Calls:
## m1: lm(formula = density ~ residual.sugar, data = wdata)
## m2: lm(formula = density ~ residual.sugar + alcohol, data = wdata)
## m3: lm(formula = density ~ residual.sugar + alcohol + total.sulfur.dioxide,
## data = wdata)
## m4: lm(formula = density ~ residual.sugar + alcohol + total.sulfur.dioxide +
## quality, data = wdata)
##
## ====================================================================================
## m1 m2 m3 m4
## ------------------------------------------------------------------------------------
## (Intercept) 0.991*** 1.005*** 1.003*** 1.004***
## (0.000) (0.000) (0.000) (0.000)
## residual.sugar 0.000*** 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000)
## alcohol -0.001*** -0.001*** -0.001***
## (0.000) (0.000) (0.000)
## total.sulfur.dioxide 0.000*** 0.000***
## (0.000) (0.000)
## quality -0.000***
## (0.000)
## ------------------------------------------------------------------------------------
## R-squared 0.704 0.907 0.911 0.912
## adj. R-squared 0.704 0.907 0.911 0.912
## sigma 0.002 0.001 0.001 0.001
## F 11636.984 23791.076 16738.603 12698.715
## p 0.000 0.000 0.000 0.000
## Log-likelihood 24498.873 27328.019 27448.397 27474.452
## Deviance 0.013 0.004 0.004 0.004
## AIC -48991.747 -54648.037 -54886.794 -54936.903
## BIC -48972.257 -54622.051 -54854.311 -54897.924
## N 4898 4898 4898 4898
## ====================================================================================
We can see (according to the R^2 value) that most of the variance in density can be explained by the residual sugar (70% of the density variance). But if we add the alcohol and total SO2 we account for the 91% of the variance. Quality is not a measure variable, but a variable based on sensory data, but if we added to the model, we can explain 92% of the density variance.
We have found more evidence drawing the values that quality is weakly negatively correlated with density (-0.31), the the lower the density wines tend to have higher perceived quality, except for those wines with a residual sugar around 15, in that case, high quality wines seem to allow higher densities. We“ve also found that quality is not really correlated with residual sugar (-0.1) and that quality and alcohol seem to be weakly correlated (0.44), for higher alcohol levels, quality seems to be a little bit better. Combining more variables we have seen that in our dataset, for drier wines, the better quality is found at higher levels of alcohol, but for sweeter wines, low levels of alcohol seem to be a better combination for perceived quality.
We have found the stronger correlation between density and sweetness we“ve already seen in the bivariate study, and the relation between both of them and alcohol, both are negatively related with alcohol, being the correlation stronger for density and alcohol (-0.78) than for residual sugar and alcohol (-0.45), however, we have seen that for a given density, the negative correlation between residual sugar and alcohol is really strong.
Regarding acids, we have also found the positive relation between fixed acidity and pH level but not between the volatile acitidy and the pH level or between the citric acid and the pH level.
Regarding SO2 and sulphates we haven“t found any relation between them, apart from the positive correlation between the two sulfur dioxides we have alredy talked about.
We have deeply study the āsweet-dense outliersā, and specially the highest of them, shows a high level of other features (volatile acidity, citric acid, pH and sulfates), still they perceived quality of those outliers is just median (6).
Yes, I created a linear model for the density of the wine. One of the variables included (residual.sugar) accounts for the 70% of the density variance, but the model gets a little better if we add alcohol and total SO2, all of them combined account for the 91% of the density variance. We could include also the quality, although it is a perceived (sensory) variable. In this case we could account for the 92% of the variance in density.
This is a fraction of the citric acid distribution, zoomed to study the unexpected peaks at 0.49 and 0.74 g/dm3, the relation between those peaks and the duplicated values was study, but just a small portion of those values was due to duplicates, we could not confirm those values were not valid, and they were included in the study. ### Plot Two
One of the limitations of our data is the scarcity of data points with extreme values, we can see that most of the wines have a score of 6. Also the number of wines with medium and medium-high level of alcohol is low. Still we can see that there is a positive correlation (weak though -> 0.44) between quality and alcohol level. We can see that in the boxplot, with the median of quality for the medium-high alcohol wines being 7 and for the low alcohol wine being 5.
We can see that the cloud of points are slightly darker to the right of the graph and also at an area around 9% of alcohol and 15 g/dm3 of residual sugar. Therefore we can say that in our data, for drier wines, the better quality is found at higher levels of alcohol, but for sweeter wines, low levels of alcohol seem to be a better combination for perceived quality.
We have a dataset of 4898 observations of portuguese white vinho verde. The dataset is quite unbalanced given the fact that it contains mostly medium quality wines, furthermore it contains several duplicates that we couldn“t discard since we couldn“t prove they were dataset errors. All of the variables but the quality are measured, being the quality sensorial. There are also several outliers, like a few sweeter wines, that we studied also separately searching for relations with other variables.
We have tried to study what makes a wine better or worse, comparing the quality of the wine with the rest of the variables, but only the alcohol seemed to have a relation with quality (not very strong though, a correlation index of 0.44). We could also study how some combined variables, like alcohol and residual sugar, affected the perceived quality.
We did find strong relations between other variables, like residual sugar and density, and also density with alcohol. In fact, density was the variable that showed stronger relations to other variables, thus we have created a linear model to predict the density of the wine using residual sugar, alcohol and total sulfur dioxide, and we could improve the model a little bit if we added quality.
Future work with this dataset could imply searching for more data to increase the number of data entries and reduce its unbalance to have more reliable results. Also, contacting the authors of the dataset to ask about the existence of duplicates and outliers, to reassure the reliability of the data would be a good idea.